DSCI Capstone Project Prep

Overview

Background

Data

I am using the micro array gene expression and diagnosis data from the Alzheimer’s Disease NeuroImaging Initiative. http://adni.loni.usc.edu/

The data contains Alzheimer’s diagnoses based on MRI scans, PET scans, and professional evaluations.

The data tables that I am using are:
ADMIMERGE: Contains basic patient information and their diagnoses.
ADNI_Gene_Expression_Profile: Contains gene expression data from blood samples.

Objective

I plan on using the blood gene expression data to build a predictive model for the detection of Alzheimer’s data. I also intend to do some exploratory analysis to see which gene expressions are associated with Alzheimer’s.

Data Prep

Imports and Markdown Setup

R setup

library(tidyverse)
library(reticulate)
library(tblhelpr)
Sys.setenv(RETICULATE_PYTHON = "/Users/liamf/miniconda3/envs/dsci2_py38/python.exe")
use_condaenv("dsci2_py38", required = TRUE)

Python Setup

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
import matplotlib.pyplot as plt
import seaborn as sns

Reading and Processing Diagnosis Data

# Some of this code from previous work.
# Pipe for wrangling Diagnosis data.
diagnosis <- read_csv(
  "/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/ADNIMERGE.csv"
  ) %>%
  mutate(RID = as.integer(RID)) %>%
  mutate(study_enrolled = case_when(RID < 2000 ~ "ADNI1",
    RID >= 2000 & RID < 4000  ~ "ADNIGO",
    RID >= 4000 & RID < 6000  ~ "ADNI2",
    RID > 6000 ~ "ADNI3")
  ) %>%
  mutate(baseline_diagnosis = recode(DX_bl,
                                     EMCI = "MCI",
                                     LMCI = "MCI",
                                     SMC = "CN")) %>%
  rename(VISCODE2 = VISCODE) %>%
  filter(VISCODE2 == "bl") %>%
  select(RID, baseline_diagnosis, study_enrolled, VISCODE2)

diagnosis
## # A tibble: 2,380 x 4
##      RID baseline_diagnosis study_enrolled VISCODE2
##    <int> <chr>              <chr>          <chr>   
##  1     2 CN                 ADNI1          bl      
##  2     3 AD                 ADNI1          bl      
##  3     4 MCI                ADNI1          bl      
##  4     5 CN                 ADNI1          bl      
##  5     6 MCI                ADNI1          bl      
##  6     7 AD                 ADNI1          bl      
##  7    10 AD                 ADNI1          bl      
##  8    14 CN                 ADNI1          bl      
##  9    15 CN                 ADNI1          bl      
## 10    16 CN                 ADNI1          bl      
## # ... with 2,370 more rows

Missing Data

# Viewing # missing for each column.
diagnosis %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  rowid_to_column() %>%
  transpose_tibble(rowid) %>%
  rename()
## # A tibble: 4 x 2
##   columns              `1`
##   <chr>              <int>
## 1 RID                    0
## 2 baseline_diagnosis    28
## 3 study_enrolled         0
## 4 VISCODE2               0

Original diagnoses are actually missing.

RID (patient) Key

max(
  diagnosis %>%
    group_by(RID) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    select(count)
)
## [1] 1

The RID (patient ID’s) occur no more than once. They correctly serve as a key.

Reading and Processing Genetic Data

# Data Wrangling pipe for genetic data.
ADNI1GO2_genes <- read_csv("/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/ADNI_Gene_Expression_Profile.csv") %>%
  select(-"...2", -"...3") %>%
  transpose_tibble(Phase, id_col = "former_columns") %>%
  mutate(RID = str_replace(SubjectID, ".+?(_S_)", "")) %>%
  mutate(RID = as.integer(RID)) %>%
  select(RID, everything()) %>%
  select(-Visit, -former_columns, -SubjectID,
         -"260/280", -"260/230", -"RIN", -"Affy Plate",
         -"YearofCollection", -"ProbeSet") %>%
  drop_na()

ADNI1GO2_genes
## # A tibble: 744 x 49,387
##      RID `11715100_at` `11715101_s_at` `11715102_x_at` `11715103_x_at`
##    <int> <chr>         <chr>           <chr>           <chr>          
##  1  1249 2.237         2.624           1.873           2.92           
##  2  4410 2.294         2.416           1.884           2.668          
##  3  4153 2.14          2.322           1.999           3.634          
##  4  1232 2.062         2.5             1.851           3.632          
##  5  4205 2.04          2.395           2.08            3.278          
##  6  4467 2.439         2.309           1.997           3.578          
##  7   205 1.955         2.451           1.539           3.362          
##  8  2374 2.372         2.403           1.926           3.371          
##  9  4491 2.327         2.738           1.922           4.124          
## 10  4059 2.535         2.478           2.073           3.824          
## # ... with 734 more rows, and 49,382 more variables: `11715104_s_at` <chr>,
## #   `11715105_at` <chr>, `11715106_x_at` <chr>, `11715107_s_at` <chr>,
## #   `11715108_x_at` <chr>, `11715109_at` <chr>, `11715110_at` <chr>,
## #   `11715111_s_at` <chr>, `11715112_at` <chr>, `11715113_x_at` <chr>,
## #   `11715114_x_at` <chr>, `11715115_s_at` <chr>, `11715116_s_at` <chr>,
## #   `11715117_x_at` <chr>, `11715118_s_at` <chr>, `11715119_s_at` <chr>,
## #   `11715120_s_at` <chr>, `11715121_s_at` <chr>, `11715122_at` <chr>, ...

Key Check

max(
  ADNI1GO2_genes %>%
    group_by(RID) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    select(count)
)
## [1] 1

The RID (patient ID’s) occur no more than once. They correctly serve as a key.

Writing Prepped Data to CSV’s

write_csv(ADNI1GO2_genes, "/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/genes_prepped.csv")

Joining Diagnosis Table to Gene Expression and Writing to CSV

data_prepped <- diagnosis %>%
  inner_join(ADNI1GO2_genes, by = "RID")

write_csv(data_prepped, "/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/data_prepped.csv")

Summary

The resulting table after the data wrangling has the following characteristics:

N Row: 774, corresponds to 774 people.
N Col: 49390, most of these are the blood gene expressions for various genes.
Classes: There are three target classes: CN (cognitively normal), MCI (mild cognitive impairment), AD (Alzheimer’s Disease).

Data Visualization and Prep for Modelling in SKLearn

data_prepped %>%
  ggplot(aes(baseline_diagnosis)) +
    geom_bar(fill = "#1c0080") +
    geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
    theme_minimal() +
    xlab("Diagnosis") + ylab("Number of People")

data_prepped %>%
  rename("gene" = "11715100_at") %>%
  mutate("Example_Gene_Expression" = as.double(gene)) %>%
  select("Example_Gene_Expression") %>%
  ggplot(aes(Example_Gene_Expression)) +
    geom_histogram(fill = "#1c0080") +
    theme_minimal() +
    xlab("Example Gene Expression") + ylab("Frequency")

Due to the small number of those diagnosed with Alzheimer’s in the data, a StratifiedShuffleSplit will be used to split the data into training and testing.

Reading and Splitting Data Read Into Pandas

pandas_data_prepped = pd.read_csv('/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/data_prepped.csv')

Checking Data Types

print(pandas_data_prepped.iloc[: , :8].dtypes)
## RID                     int64
## baseline_diagnosis     object
## study_enrolled         object
## VISCODE2               object
## 11715100_at           float64
## 11715101_s_at         float64
## 11715102_x_at         float64
## 11715103_x_at         float64
## dtype: object
# Performing a Stratified Shuffle Split on the diagnosis variable.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
for train_index, test_index in split.split(pandas_data_prepped, pandas_data_prepped["baseline_diagnosis"]):
    strat_train_set = pandas_data_prepped.loc[train_index]
    strat_test_set = pandas_data_prepped.loc[test_index]

print(strat_test_set["baseline_diagnosis"].value_counts())
## MCI    110
## CN      65
## AD      11
## Name: baseline_diagnosis, dtype: int64
strat_train_set.to_csv('/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/train_set.csv')
strat_test_set.to_csv('/Users/liamf/OneDrive/Documents/DataSci2/Capstone Project/data/test_set.csv')

Data Visualization

# pairplot
pair1 = sns.pairplot(data=strat_train_set.iloc[:,5:10])
plt.show()

plt.clf()

# pairplot with diagnosis diferentiation
pair2 = sns.pairplot(data=strat_train_set.iloc[:,[1,5,6,7,8,9,10]], hue = "baseline_diagnosis")
plt.show()

plt.clf()
# dataprep for correlation heatmap
df_sub = strat_train_set.iloc[:,4: ].sample(n=10, axis='columns', random_state=42)
cor = df_sub.corr()
# Triangle correlation heatmap
mask = np.triu(np.ones_like(cor, dtype=np.bool))
## <string>:1: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
## Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
heatmap = sns.heatmap(df_sub.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
plt.show()

plt.clf()

Considerations for Part 2 (Modelling)

  • The number of columns far outweighs the number of rows which will have to be accounted for when choosing model’s.
  • Variable selection is key as 49,390 variables is too many for 774 rows. PCA, ANOVA, models with their own variable selection are all options.
  • Over fitting will also be a huge concern. The models that are trained will have to account for this and the hyperperameters will likely have to do lots of regularization.